Lesson 11: groupby

Author

최규빈

Published

July 25, 2023

강의영상

imports

import numpy as np
import pandas as pd

df.groupby

- 예제1: 아래의 예제에서 그룹별 평균을 구하여라.

_category = ['A']*5+['B']*5
_value = np.concatenate([np.random.randn(5), np.random.randn(5)+10])
df = pd.DataFrame({'category':_category, 'value':_value})
df

	category	value
0	A	-1.115880
1	A	1.109800
2	A	0.364589
3	A	0.518465
4	A	1.571603
5	B	8.883637
6	B	12.943322
7	B	8.463850
8	B	10.164719
9	B	10.685973

- 방법1: groupby를 이용하지 않는 방법들

# 이렇게 해도 되고 
df[df.category == 'A'].value.mean(), df[df.category == 'B'].value.mean()

(0.4897153568753304, 10.228300113047595)

# 이것도 가능
{s:df[df.category == s].value.mean() for s in set(df.category)}

{'A': 0.4897153568753304, 'B': 10.228300113047595}

- 방법2: groupby만 사용

grouped_list = list(df.groupby('category'))
grouped_list

[('A',
    category     value
  0        A -1.115880
  1        A  1.109800
  2        A  0.364589
  3        A  0.518465
  4        A  1.571603),
 ('B',
    category      value
  5        B   8.883637
  6        B  12.943322
  7        B   8.463850
  8        B  10.164719
  9        B  10.685973)]

bundle1, bundle2 = grouped_list

group_name1, sub_df1 = bundle1

display(sub_df1)

	category	value
0	A	-1.115880
1	A	1.109800
2	A	0.364589
3	A	0.518465
4	A	1.571603

for group_name, sub_df in df.groupby('category'):
    print(group_name)
    display(sub_df)

A
B

	category	value
0	A	-1.115880
1	A	1.109800
2	A	0.364589
3	A	0.518465
4	A	1.571603

	category	value
5	B	8.883637
6	B	12.943322
7	B	8.463850
8	B	10.164719
9	B	10.685973

{group_name:sub_df.value.mean() for group_name, sub_df in df.groupby('category')}

{'A': 0.4897153568753304, 'B': 10.228300113047595}

- 방법3: groupby().aggregate()

grouped = df.groupby('category')
grouped.aggregate(np.mean)

	value
category
A	0.489715
B	10.228300

df.groupby(['category']).aggregate(np.mean)

	value
category
A	0.489715
B	10.228300

groupby(?)에서 올 수 있는 구조

열의이름
[열의이름,열의이름]

aggregate(?)에서 올 수 있는 구조

함수: 함수자체¹가 오거나, 함수를 의미하는 문자열² 이 올 수 있음.
리스트: [함수, 함수] # 여기에서 함수자리에는 함수자체, 혹은 함수문자열 아무것이나 올 수 있음.
딕셔너리1: {열의이름:함수}
딕셔너리2: {열의이름:[함수,함수]}

¹ np.mean,sum

² ‘size’, ‘count’, ‘sum’, ‘mean’, ‘median’, ‘min’, ‘max’, ‘std’, ‘var’

df.groupby(['category']).aggregate({'value':[np.mean,'size']})

	value
	mean	size
category
A	0.489715	5
B	10.228300	5

pd.cut + df.groupby

- 구간별 count

arr = np.random.rand(1000)

_bin = [0, 0.2, 0.5, 0.9, 1]
_bin

[0, 0.2, 0.5, 0.9, 1]

- [0,0.2,0.5,0.9,1] 구간에 몇개의 숫자들이 있는지 알고 싶고, 구간별 평균도 알고싶다.

pd.DataFrame({'cat':pd.cut(arr,_bin), 'arr':arr}).groupby('cat').aggregate({'arr':['size',np.mean]})

	arr
	size	mean
cat
(0.0, 0.2]	189	0.099927
(0.2, 0.5]	309	0.356922
(0.5, 0.9]	407	0.702618
(0.9, 1.0]	95	0.949145

flights data

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58492 entries, 0 to 58491
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   MONTH      58492 non-null  int64  
 1   DAY        58492 non-null  int64  
 2   WEEKDAY    58492 non-null  int64  
 3   AIRLINE    58492 non-null  object 
 4   ORG_AIR    58492 non-null  object 
 5   DEST_AIR   58492 non-null  object 
 6   SCHED_DEP  58492 non-null  int64  
 7   DEP_DELAY  57659 non-null  float64
 8   AIR_TIME   57474 non-null  float64
 9   DIST       58492 non-null  int64  
 10  SCHED_ARR  58492 non-null  int64  
 11  ARR_DELAY  57474 non-null  float64
 12  DIVERTED   58492 non-null  int64  
 13  CANCELLED  58492 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 6.2+ MB

- 예제1: 항공사(AIRLINE)별로 도착지연시간의(ARR_DELAY)의 평균을 구하라.

df.groupby('AIRLINE').aggregate({'ARR_DELAY':'mean'})

	ARR_DELAY
AIRLINE
AA	5.542661
AS	-0.833333
B6	8.692593
DL	0.339691
EV	7.034580
F9	13.630651
HA	4.972973
MQ	6.860591
NK	18.436070
OO	7.593463
UA	7.765755
US	1.681105
VX	5.348884
WN	6.397353

- 예제2: 항공사(AIRLINE)별로 비행취소건수(CANCELLED)의 합계를 구하라. 취소건수가 가장 높은 두개의 항공사는 어디인가?

df.groupby('AIRLINE').aggregate({'CANCELLED':'sum'})

	CANCELLED
AIRLINE
AA	154
AS	0
B6	1
DL	38
EV	146
F9	10
HA	0
MQ	152
NK	25
OO	142
UA	93
US	21
VX	6
WN	93

- 예제3: 항공사(AIRLINE)별로 비행취소율(CANCELLED)을 구하라. 비행취소율이 가장 높은 두개의 항공사는 어디인가?

df.groupby('AIRLINE').aggregate({'CANCELLED':'mean'})
# df.groupby('AIRLINE').aggregate({'CANCELLED':lambda x : sum(x)/len(x)})

	CANCELLED
AIRLINE
AA	0.017303
AS	0.000000
B6	0.001842
DL	0.003585
EV	0.024923
F9	0.007593
HA	0.000000
MQ	0.043791
NK	0.016491
OO	0.021554
UA	0.011935
US	0.013003
VX	0.006042
WN	0.011048

- 예제4: 비행취소율이 가장 높은 두개의 항공사(AIRLINE)를 선택하라. 그 두 항공사에 대하여 요일별(WEEKDAY) 비행취소율(CANCELLED)을 조사하라.

df.groupby('AIRLINE').aggregate({'CANCELLED':'mean'})

	CANCELLED
AIRLINE
AA	0.017303
AS	0.000000
B6	0.001842
DL	0.003585
EV	0.024923
F9	0.007593
HA	0.000000
MQ	0.043791
NK	0.016491
OO	0.021554
UA	0.011935
US	0.013003
VX	0.006042
WN	0.011048

MQ와 EV

- 예제5: 아래는 운행거리의 요약통계량이다.

df.DIST.describe()

count    58492.000000
mean       872.900072
std        624.996805
min         67.000000
25%        391.000000
50%        690.000000
75%       1199.000000
max       4502.000000
Name: DIST, dtype: float64

운행거리를 구간별로 [-np.inf,391,690,1199,np.inf]와 같이 나눈뒤 비행취소건수와 취소율을 구하여라.

_bin = [-np.inf,391,690,1199,np.inf]
df.assign(DIST2 = pd.cut(df.DIST,_bin)).groupby('DIST2').aggregate({'CANCELLED':['mean','sum']})

	CANCELLED
	mean	sum
DIST2
(-inf, 391.0]	0.022659	334
(391.0, 690.0]	0.013503	196
(690.0, 1199.0]	0.013637	203
(1199.0, inf]	0.010313	148

Quiz: HRDataset_v14 자료분석

아래의 코드를 활용하여 Kaggle의 HRdataset을 불러오라.

df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/HRDataset_v14.csv')
df

	Employee_Name	EmpID	MarriedID	MaritalStatusID	GenderID	EmpStatusID	DeptID	PerfScoreID	FromDiversityJobFairID	Salary	...	ManagerName	ManagerID	RecruitmentSource	PerformanceScore	EngagementSurvey	EmpSatisfaction	SpecialProjectsCount	LastPerformanceReview_Date	DaysLateLast30	Absences
0	Adinolfi, Wilson K	10026	0	0	1	1	5	4	0	62506	...	Michael Albert	22.0	LinkedIn	Exceeds	4.60	5	0	1/17/2019	0	1
1	Ait Sidi, Karthikeyan	10084	1	1	1	5	3	3	0	104437	...	Simon Roup	4.0	Indeed	Fully Meets	4.96	3	6	2/24/2016	0	17
2	Akinkuolie, Sarah	10196	1	1	0	5	5	3	0	64955	...	Kissy Sullivan	20.0	LinkedIn	Fully Meets	3.02	3	0	5/15/2012	0	3
3	Alagbe,Trina	10088	1	1	0	1	5	3	0	64991	...	Elijiah Gray	16.0	Indeed	Fully Meets	4.84	5	0	1/3/2019	0	15
4	Anderson, Carol	10069	0	2	0	5	5	3	0	50825	...	Webster Butler	39.0	Google Search	Fully Meets	5.00	4	0	2/1/2016	0	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
306	Woodson, Jason	10135	0	0	1	1	5	3	0	65893	...	Kissy Sullivan	20.0	LinkedIn	Fully Meets	4.07	4	0	2/28/2019	0	13
307	Ybarra, Catherine	10301	0	0	0	5	5	1	0	48513	...	Brannon Miller	12.0	Google Search	PIP	3.20	2	0	9/2/2015	5	4
308	Zamora, Jennifer	10010	0	0	0	1	3	4	0	220450	...	Janet King	2.0	Employee Referral	Exceeds	4.60	5	6	2/21/2019	0	16
309	Zhou, Julia	10043	0	0	0	1	3	3	0	89292	...	Simon Roup	4.0	Employee Referral	Fully Meets	5.00	3	5	2/1/2019	0	11
310	Zima, Colleen	10271	0	4	0	1	5	3	0	45046	...	David Stanley	14.0	LinkedIn	Fully Meets	4.50	5	0	1/30/2019	0	2

311 rows × 36 columns

다음을 읽고 참 거짓을 판단하여라.

1. 근무인원수가 가장 많은 인종(RaceDesc)은 ’White’이며 이는 ’Asian’인종과 ’Black or African American’의 합보다 많다.

2. ’RaceDesc==White’의 성별(Sex)임금차이는 2000이상이다.

3. 퇴직한사람(Termd==1)은 모두 104명이며 백인여성의 퇴직자수가 가장 많다.

4. 퇴직한사람중 아시아인의 비율은 10%가 넘지 않는다.

아래를 조사하라.

5. [성별(Sex), 결혼유무(MarriedID)] 별 연봉(Salary)의 평균을 조사하라. 어떠한 그룹이 가장 평균연봉이 적은가?

6. [성별(Sex), 인종별(RaceDesc)] 별 연봉(Salary)의 중앙값(‘median’)을 조사하라. 연봉의 중앙값이 가장 높은 그룹은 무엇인가?

7. [성별(Sex), 인종별(RaceDesc)] 별 연봉(Salary)의 중앙값(‘median’)과 근무인원수를 함께 조사하라. 연봉의 중앙값이 가장 높은 그룹이 혜택을 받는 그룹이라고 느껴지는가?

!git add . 
!git commit -m .
!git push
!quarto publish --no-browser --no-prompt

[main 7780095] .
 2 files changed, 1186 insertions(+), 6 deletions(-)
Enumerating objects: 13, done.
Counting objects: 100% (13/13), done.
Delta compression using up to 16 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 2.07 KiB | 2.07 MiB/s, done.
Total 7 (delta 4), reused 0 (delta 0)
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
To https://github.com/guebin/PP2023SUM.git
   a8c13c7..7780095  main -> main
From https://github.com/guebin/PP2023SUM
 * branch            gh-pages   -> FETCH_HEAD
Rendering for publish:

[ 1/27] 1_1.qmd
[ 2/27] about.qmd
[ 3/27] 1_2.qmd
[ 4/27] 2_2.qmd
[ 5/27] posts/Day3/08_PandasBackend/ls21.ipynb
[ 6/27] posts/Day3/08_PandasBackend/ls20.ipynb
[ 7/27] posts/Day3/07_ggplot2/ls16.ipynb
[ 8/27] posts/Day3/07_ggplot2/ls19.ipynb
[ 9/27] posts/Day3/07_ggplot2/ls18.ipynb
[10/27] posts/Day3/07_ggplot2/ls17.ipynb
[11/27] posts/Day2/03_Pandas/ls07.ipynb
[12/27] posts/Day2/03_Pandas/ls08.ipynb
[13/27] posts/Day2/03_Pandas/ls09.ipynb
[14/27] posts/Day2/04_Function, Iteration/ls10.ipynb
[15/27] posts/Day2/04_Function, Iteration/ls11.ipynb
[16/27] posts/Day2/05_Modulesls12.ipynb
[17/27] posts/Day1/01_PythonBasicSyntax/ls02.ipynb
[18/27] posts/Day1/01_PythonBasicSyntax/ls04.ipynb
[19/27] posts/Day1/01_PythonBasicSyntax/ls01.ipynb
[20/27] posts/Day1/01_PythonBasicSyntax/ls03.ipynb
[21/27] posts/Day1/02_Numpy/ls05.ipynb
[22/27] posts/Day1/02_Numpy/ls06.ipynb
[23/27] 3_2.qmd
[24/27] 3_1.qmd
[25/27] index.qmd
[26/27] 2_3.qmd
[27/27] 2_1.qmd

Preparing worktree (resetting branch 'gh-pages'; was at e9a8acb)
Branch 'gh-pages' set up to track remote branch 'gh-pages' from 'origin'.
HEAD is now at e9a8acb Built site for gh-pages
[gh-pages 6edb56e] Built site for gh-pages
 24 files changed, 1144 insertions(+), 687 deletions(-)
origin  https://github.com/guebin/PP2023SUM.git (fetch)
origin  https://github.com/guebin/PP2023SUM.git (push)
To https://github.com/guebin/PP2023SUM.git
   e9a8acb..6edb56e  HEAD -> gh-pages

NOTE: GitHub Pages sites use caching so you might need to click the refresh
button within your web browser to see changes after deployment.

[✓] Published to https://guebin.github.io/PP2023SUM/

NOTE: GitHub Pages deployments normally take a few minutes (your site updates
will be visible once the deploy completes)